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Abstract 

Work  under  this  grant  focused  on  methods  for  extracting  hidden  information  from  network  data, 
including  data  from  social  networks,  networks  of  communications  and  interactions,  heath  or  disease 
networks,  and  brain  networks.  During  the  last  12  months  of  this  project,  the  funding  level  was  cut 
substantially.  Nevertheless,  over  this  period,  our  team  worked  on  several  substantial  projects,  including 
the  development  of  several  powerful  new  algorithms  for  analyzing  networks  and  their  application  to 
specific  real-world  domains.  These  efforts  produced  8  peer-reviewed  papers  or  new  preprints,  and  more 
than  a  dozen  invited  or  contributed  presentations  on  these  projects. 

We  continued  to  focus  on  developing  powerful  and  scalable  Bayesian  statistical  and  related  inference 
methods  for  community  structure,  hierarchies,  core-periphery  structure,  rankings,  and  other  large-scale 
network  structures,  and  on  discovering  the  fundamental  limits  of  these  techniques  for  inferring  such 
hidden  patterns.  Additionally,  we  focused  on  algorithms  applicable  to  very  large  networks,  networks  with 
auxiliary  information  (such  as  annotations,  temporal  dynamics,  or  edge  weights),  and  demonstrations  of 
these  techniques  to  domains  of  interest. 

In  particular,  we  discovered  new  phase  transitions  in  the  detectability  of  hidden  community  structure 
in  both  unsupervised  and  semisupervised  settings,  and  in  time-evolving  networks;  developed  message¬ 
passing  approaches  for  models  of  recurrent  epidemics  on  social  networks;  demonstrated  how  our  prob¬ 
abilistic  models  can  help  distinguish  distinct  edge  formation  mechanisms;  developed  a  novel  approach 
to  identifying  hidden  core-periphery  structure  in  networks;  and  developed  a  new  algorithm  for  both 
identifying  and  characterizing  change  points  in  the  structural  patterns  of  temporal  networks. 

No  future  work  is  anticipated  for  this  project,  as  this  is  the  final  report  for  the  grant. 

Over  the  past  12  months,  we  made  further  progress  on  our  Phase  II  tasks  as  described  in  our  proposal 
and  our  2014  statement  of  work,  along  with  selected  ongoing  tasks  from  Phase  I  and  tasks  from  Phase  III 
for  which  we  made  early  progress. 

1 . 1  Incorporating  auxiliary  information 

1.2  Scalahility 

1 .4  Ranking  in  directed  networks 

1.5  Compensating  for  partial  or  noisy  data 

1.6  Guided  exploration  of  the  network 

1.7  Anomaly  detection  in  dynamic  networks 

Our  progress  on  these  tasks  is  detailed  helow,  which  has  produced  8  peer-reviewed  papers  or  new  preprints, 
as  well  as  numerous  colloquia,  seminars,  and  conference  presentations. 

In  particular,  we  made  significant  progress  on  the  following  problems,  models,  and  algorithms: 

•  Scalable  algorithms  for  detecting  statistically-significant  communities  and  hierarchical  structure,  us¬ 
ing  techniques  from  spin  glasses  and  message-passing  for  modularity. 
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•  The  performance  of  semisupervised  learning  in  networks,  which  we  discovered  undergoes  a  phase 
transition  in  accuracy  when  the  amount  of  available  metadata  crosses  a  critical  threshold. 

•  Inferring  social  status  and  niche  status  in  social  and  ecological  networks. 

•  Inference  algorithms  for  identifying  vertices  with  high  centrality  or  importance,  and  hidden  “core¬ 
periphery”  structure  in  networks. 

•  Detecting  “change  points”  in  dynamic  networks,  where  new  structural  patterns  occur. 

•  Scalable  algorithms  for  detecting  community  structure  in  dynamic  networks,  along  with  the  discovery 
of  a  detectability  phase  transition  as  a  function  of  the  rate  of  change  and  the  strength  of  the  community 
structure. 

•  Efficient  modeling  and  prediction  of  epidemic  models,  including  threshold  models  and  recurrent  dis¬ 
ease  models  such  as  SIS  and  SIRS. 

1  Scalable  detection  of  statistically  significant  communities 

Moore,  postdoc  Zhang 
Research  Tasks  1.1, 1.3  and  1.5 

Modularity  is  a  popular  measure  of  assortative  community  structure,  which  measures  the  number  of  links 
internal  to  communities  as  compared  to  what  we  would  expect  from  a  null  model  where  the  network  is 
rewired  randomly.  However,  maximizing  the  modularity  can  lead  to  many  competing  partitions,  which  have 
almost  the  same  modularity  but  which  have  little  in  common  with  each  other;  it  can  also  overfit,  producing 
illusory  “communities”  in  random  graphs  where  none  exist.  We  address  this  problem  using  the  tools  of 
statistical  physics  and  spin  glass  theory.  Algorithmically,  we  use  the  modularity  as  a  Hamiltonian,  compute 
the  marginals  of  the  resulting  Gibbs  distribution,  and  assign  each  node  to  its  most-likely  community  under 
these  marginals.  In  contrast  to  the  “ground  state”  where  the  modularity  is  maximized,  which  is  prone  to 
overfitting,  this  marginal-based  partition  measures  statistically-significant  community  structure.  In  essence, 
it  represents  the  consensus  of  many  high-modularity  partitions,  as  opposed  to  the  single  “best”  one. 

We  have  derived  an  efficient  Belief  Propagation  (BP)  algorithm  to  compute  these  marginals,  with  a 
tunable  temperature  parameter.  In  random  networks  with  no  true  communities,  the  system  has  two  phases 
as  we  vary  the  temperature:  a  paramagnetic  phase  where  all  marginals  are  equal,  and  a  spin  glass  phase 
where  BP  fails  to  converge.  In  networks  with  real  community  structure,  there  is  an  additional  retrieval  phase 
where  BP  converges,  and  where  the  marginals  are  strongly  correlated  with  the  underlying  communities.  We 
show  analytically  and  numerically  that  our  algorithm  works  all  the  way  down  to  the  detectability  transition 
in  networks  generated  by  the  stochastic  block  model.  Our  algorithm  is  highly  scalable:  each  update  takes 
time  linear  in  the  number  of  edges  (with  a  prefactor  proportional  to  the  number  of  groups)  and  even  on  large 
networks  it  converges  rapidly. 

By  applying  our  algorithm  recursively,  subdividing  communities  until  no  statistically-significant  sub¬ 
communities  can  be  found,  we  can  detect  hierarchical  structure  in  real-world  networks  more  efficiently  than 
previous  methods.  For  instance,  applied  to  the  popular  network  data  set  of  political  blogs,  our  algorithm  first 
finds  fwo  large  communities  corresponding  fo  liberals  and  conservatives,  and  agreeing  with  the  ground-truth 
labels  on  95%  of  the  nodes.  However,  our  algorithm  also  finds  subcommunifies  af  multiple  scales. 

We  have  tested  our  algorithm  against  many  other  popular  techniques,  including  Louvain,  Infomap,  and 
OSLOM,  and  it  significantly  outperforms  them.  We  also  found  large  statistically-significant  communities  in 
several  networks  where  previous  literature  had  claimed  they  do  not  exist. 

Since  the  last  report,  this  work  has  been  extended  and  published. 
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•  P.  Zhang  and  C.  Moore,  “Scalable  detection  of  statistically  significant  communities  and  hierarchies: 
message-passing  for  modularity.”  Proc.  Natl.  Acad.  Sci.  USA  111  (51),  18144-18149  (2014). 
Preprint:  http://arxiv.org/abs/1403.5787 

2  Phase  transitions  in  semisupervised  learning  in  networks 

Moore,  postdoc  Zhang,  collaborator  Zdeborova 
Research  Tasks  1.1, 1.5  and  1.6 

In  many  networks,  nodes  have  attributes  or  metadata.  If  these  are  known  for  some  subset  of  the  nodes,  we 
would  like  to  be  able  to  make  good  guesses  about  the  others,  filling  in  our  missing  information  about  the 
network.  Our  belief  propagation  algorithms  give  a  scalable  and  mathematically  principled  way  to  do  this, 
in  time  nearly-linear  in  the  network  size.  Starting  with  the  known  labels,  we  propagate  the  probabilities  of 
different  labels  to  the  rest  of  the  network,  until  we  reach  a  fixed  point. 

We  have  found  that  there  is  often  a  sudden  jump  in  the  accuracy  we  can  achieve,  as  a  function  of  the 
fraction  of  nodes  for  which  the  labels  are  known.  When  this  fraction  is  too  small,  we  have  only  local  in¬ 
formation,  about  the  known  nodes  and  their  neighbors.  But  when  this  fraction  crosses  a  critical  threshold, 
our  knowledge  becomes  global,  percolating  throughout  the  entire  network  and  letting  us  accurately  predict 
almost  all  the  nodes.  Using  the  cavity  method  of  statistical  physics,  we  have  given  an  efficient  BP  algo¬ 
rithm  for  propagating  this  information,  and  computed  the  critical  threshold  for  networks  generated  by  the 
stochastic  block  model.  With  a  simplified  analysis,  we  can  determine  the  threshold  exactly,  showing  how  the 
existence  of  multiple  fixed  points  causes  a  first-order  transition  where  the  accuracy  jumps  discontinuously. 
We  have  found  qualitatively  similar  behavior  in  real  networks. 

Since  the  last  report,  the  final  paper  on  this  topic  was  completed  and  subsequently  published. 

•  P.  Zhang,  C.  Moore,  and  L.  Zdeborova,  “Phase  transitions  in  semisupervised  clustering  of  sparse 
networks.”  Physical  Review  E  90,  052802  (2014). 

Preprint:  http://arxiv.org/abs/1404.7789 

3  Probabilistic  models  of  status  and  ranking 

Clause!  and  Moore,  grad  student  Jacobs,  collaborator  Dunne 
Research  Tasks  1.4  and  1.6 

Characterizing  the  structural  roles  of  vertices  in  networks  remains  an  important  open  problem,  with  most 
work  focusing  on  “centrality”  heuristics  that  often  have  uncertain  theoretical  relevance  to  particular  net¬ 
works.  In  this  work,  we  adapt  a  flexible  probabilistic  model  of  vertex  position  or  status,  originally  developed 
in  mathematical  ecology  as  a  model  of  food  webs  and  called  the  probabilistic  niche  model,  to  the  task  of 
characterizing  the  positional  structure  of  vertices  within  arbitrary  networks. 

We  apply  a  powerful  multi-temperature  Monte  Carlo  technique  from  statistical  physics  called  parallel 
tempering  for  sampling  and  optimizing  the  posterior  distribution  of  models.  This  technique  has  not  pre¬ 
viously  been  applied  to  estimating  network  models,  but  provides  excellent  results  in  this  case.  We  then 
use  this  technique  to  address  an  outstanding  problem  in  the  theory  of  food  webs,  namely  whether  parasites 
and  free-living  predators  play  different  structural  roles.  While  ecological  networks  are  not  currently  an 
area  of  DARPA  concern,  we  regard  this  as  valuable  test  bed  for  developing  techniques  and  algorithms  that 
can  address  analogous  problems  in  social,  technological,  and  brain  networks:  specifically,  to  test  for  sta¬ 
tistically  significant  heterogeneities  in  network  structure,  based  on  physical  location,  social  status,  or  other 
continuous- valued  correlates. 
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We  use  three  tests:  (i)  calculated  goodness-of-fit  measures  of  learning  within-class  edges  when  the 
model  is  shown  only  a  single  class  or  shown  both  classes  of  vertices  together,  (ii)  comparing  network 
structural  statistics  between  the  empirical  network  and  networks  generated  from  the  learned  models,  and 
(iii)  calculated  link-prediction  accuracy  when  the  model  is  shown  only  a  single  class  or  shown  both  classes 
together.  By  measuring  AUC  statistics  for  link  prediction,  we  are  able  to  assess  which  types  of  links  are 
well-fit  by  the  model,  and  which  it  considers  anomalous. 

Since  the  last  report,  this  paper  has  been  submitted  to  a  journal  and  is  currently  being  revised  for  resub¬ 
mission. 

•  A.  Z.  Jacobs,  C.  Moore,  J.  A.  Dunne  and  A.  Clause!,  “Untangling  the  roles  of  parasites  in  food  webs 

with  generative  network  models.”  Submitted  (2015). 

Preprint;  http  :  / / arxiv .  org/ abs/150  5 . 04741 

4  Identifying  core-periphery  structure  in  networks 

Newman,  graduate  students  Zhang  and  Martin 
Research  Tasks  1.2  and  1.4 

Many  networks  can  be  usefully  decomposed  into  a  dense  core  plus  an  outlying,  loosely-connected  periph¬ 
ery.  Here  we  propose  an  algorithm  for  performing  such  a  decomposition  on  empirical  network  data  using 
methods  of  statistical  inference.  Our  method  fits  a  generative  model  of  core-periphery  structure  to  observed 
data  using  a  combination  of  an  expectation-maximization  algorithm  for  calculating  the  parameters  of  the 
model  and  a  belief  propagation  algorithm  for  calculating  the  decomposition  itself.  We  found  our  method  to 
be  efficient,  scaling  easily  to  networks  with  a  million  or  more  nodes.  We  tested  it  on  a  range  of  networks,  in¬ 
cluding  real-world  examples  as  well  as  computer-generated  benchmarks,  for  which  it  successfully  identifies 
known  core-periphery  strucfure  wifh  low  error  rate.  We  also  demonstrated  that  the  method  is  immune  from 
the  detectability  transition  observed  in  the  related  community  detection  problem,  which  prevents  the  detec¬ 
tion  of  community  structure  when  that  structure  is  too  weak.  There  is  no  such  transition  for  core-periphery 
structure,  which  is  detectable,  albeit  with  some  statistical  error,  no  matter  how  weak  it  is. 

•  T.  Martin,  X.  Zhang,  and  M.  E.  J.  Newman,  “Identification  of  core-periphery  structure  in  networks.” 

Physical  Review  E  91,  032803  (2015). 

Preprint:  http  :  / /arxiv .  org/abs/140 9 .4813. 

5  Detecting  statistically  significant  structural  changes  in  evolving  networks 

Clauset,  postdoc  Peel 
Research  Tasks  1.1  and  1.7 

Relational  variables  among  objects  or  people  are  a  common  form  of  data,  and  networks  provide  a  general 
framework  through  which  to  quantify  and  analyze  their  patterns.  For  example,  online  social  interactions, 
offline  friendships,  and  object-user  interactions  may  all  be  represented  as  networks.  In  many  cases,  these 
relations  are  dynamic,  and  their  evolution  over  time  may  be  represented  as  a  sequence  of  networks,  each 
giving  the  interactions  among  a  common  set  of  vertices  at  consecutive  points  in  time. 

A  key  task  in  detecting  anomalies  in  evolving  networks  is  change-point  detection,  in  which  we  (1) 
identify  moments  in  time  across  which  the  large-scale  pattern  of  interactions  changes  fundamentally  and 
(2)  quantify  what  kind  and  how  large  a  change  occurred.  Identifying  the  timing  and  shape  of  such  change 
points  divides  a  network’s  evolution  into  contiguous  periods  of  relative  structural  stability,  allowing  us  to 
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subsequently  analyze  each  period  independently,  while  also  providing  hints  about  the  underlying  processes 
shaping  the  data. 

We  model  the  network  structure  with  a  novel  generative  model  that  generalizes  the  classic  hierarchical 
random  graph  model  (previously  developed  by  Clause!,  Moore  and  Newman,  Nature  2008).  This  GHRG 
model  compactly  captures  nested  community  structure  at  all  scales  in  a  network.  We  then  use  this  model  in  a 
classic  change-point  detection  framework,  in  which  we  compare  a  change  versus  a  no-change  model  within 
a  sliding  window  on  the  network  time  series,  via  a  generalized  likelihood  ratio  test,  in  order  to  identify 
statistically  significant  change  points. 

Previous  approaches  to  change-point  detection  in  networks  involved  representing  the  sequence  of  net¬ 
works  as  a  time-series  of  summary  statistics  (e.g.,  mean  degree,  clustering  coefficient,  mean  geodesic  dis¬ 
tance).  Although  useful  for  detecting  some  types  of  change  points,  we  demonstrate  that  these  methods /a// 
to  detect  changes,  up  to  80%  of  the  time,  even  when  the  structural  change  is  large  enough  that  our  approach 
is  able  to  detect  the  change  100%  of  the  time. 

We  present  a  taxonomy  of  different  types  and  sizes  of  network  change  points  and  a  quantitative  charac¬ 
terization  of  the  difficulty  of  detecting  them,  in  synthetic  network  data  with  known  change  points.  We  then 
test  the  method  on  three  real,  high-resolution  evolving  social  networks  of  physical  and  digital  interactions 
(the  Enron  email  corpus,  MIT  Reality  Mining  physical  proximity  networks,  and  Hypertext  2009  physical 
proximity  network),  showing  that  it  accurately  recovers  the  timing  of  known  signihcant  external  events. 

Since  the  last  report,  this  paper  has  been  published  and  we  have  begun  applying  the  technique  to  new 
sources  of  social  and  epidemiological  network  data. 

•  L.  Peel  and  A.  Clause!,  “Detecting  change  points  in  the  large-scale  structure  of  evolving  networks.” 
Proc.  of  the  29th  International  Conference  on  Artihcial  Intelligence  (AAAI),  2914-2920  (2015). 
Preprint:  http://arxiv.org/abs/1403.0989 
Code;  http : // gdriv . es/ let opeel/proximitynet work . html 

6  Detectability  thresholds  and  optimal  algorithms  for  community  structure 
in  dynamic  networks 

Clauset  and  Moore,  grad  student  Ghasemian,  postdocs  Zhang  and  Peel 
Research  Tasks  1.1, 1.2,  and  1.7 

Community  detection  in  dynamic  networks  inherits  many  of  the  challenges  of  community  detection  in  static 
networks,  including  learning  the  number  of  communities,  their  sizes  and  node  membership,  and  the  pattern 
of  connections  among  communities,  e.g.,  assortative,  disassortative,  core-periphery,  etc.  It  also  poses  new 
challenges,  because  both  the  network  edges  and  the  community  memberships  may  evolve  over  time.  A 
common  approach  is  to  simply  take  the  union  of  dynamic  graphs  over  a  certain  time  window,  and  treat 
the  resulting  graph  with  techniques  from  static  network  analysis,  thereby  ignoring  the  dynamics  within  the 
window.  In  this  project,  we  explicitly  model  the  dynamic  nature  of  these  networks  and  the  way  community 
memberships  change  over  time,  integrating  information  about  the  communities  in  an  optimal  way. 

Here,  we  study  the  fundamental  limits  on  learning  latent  community  structure  in  dynamic  networks. 
Specihcally,  we  study  dynamic  stochastic  block  models  where  nodes  change  their  community  membership 
over  time,  but  where  edges  are  generated  independently  at  each  time  step.  In  this  setting  (which  is  a  special 
case  of  several  existing  models),  we  are  able  to  derive  the  detectability  threshold  exactly,  as  a  function  of 
the  rate  of  change  and  the  strength  of  the  communities.  Below  this  threshold,  we  claim  that  no  algorithm 
can  identify  the  communities  better  than  chance.  We  then  give  two  algorithms  that  are  optimal  in  the 
sense  that  they  succeed  all  the  way  down  to  this  limit.  The  hrst  uses  belief  propagation  (BP),  which  gives 
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asymptotically  optimal  accuracy,  and  the  second  is  a  fast  spectral  clustering  algorithm,  based  on  linearizing 
the  BP  equations  along  the  lines  of  previous  work  under  this  grant.  We  verify  our  analytic  and  algorithmic 
results  via  numerical  simulation. 

•  A.  Ghasemian,  P.  Zhang,  A.  Clauset,  C.  Moore  and  L.  Peel,  “Detectability  thresholds  and  optimal 
algorithms  for  community  structure  in  dynamic  networks.”  Submitted  (2015). 

Preprint:  http  :  /  / arxiv .  org/ abs/150  6 . 0  617  9 

7  Efficient  prediction  of  epidemic  models  and  threshold  models 

Moore,  grad  student  Shrestha,  collaborator  Scarpino 
Research  Task  1.2  and  Uses  Cases 

Given  an  epidemic  model,  we  would  like  to  calculate  the  probability  that  a  given  node  becomes  infected, 
given  a  particular  initial  condition.  This  initial  condition  might  consist  of  a  set  of  infected  individuals,  or  the 
set  of  initial  adopters  of  a  behavior  or  trend;  it  might  also  be  probabilistic,  assigning  each  individual  some 
initial  probability  of  being  infected.  These  models  are  stochastic:  each  individual  passes  the  infection  on  to 
its  neighbors  at  a  certain  rate,  which  we  model  as  a  Markov  process  in  continuous  time.  The  straightforward 
way  to  compute  these  probabilities  is  to  simulate  the  model  many  times,  performing  many  independent  runs. 
However,  this  is  computationally  expensive,  especially  if  we  also  want  to  sweep  the  space  of  parameters 
or  initial  conditions:  for  instance,  to  fit  the  model  to  real-world  epidemic  data,  or  to  search  for  effective 
intervention  strategies. 

Following  Karrer  and  Newman  {Phys.  Rev.  E  82,  016101  (2010)),  we  developed  a  dynamic  message¬ 
passing  (DMP)  approach  to  this  problem,  that  applies  also  to  threshold  models  of  behavior  popular  in  soci¬ 
ology:  these  are  models,  first  proposed  by  Granovetter,  where  each  individual  has  to  hear  about  a  trend  or 
behavior  from  some  number  of  neighbors  being  adopting  it  themselves.  This  work  appeared  in  our  previous 
report. 

Since  then,  we  have  successfully  extended  this  work  to  more  complex  epidemic  models,  and  in  particular 
“recurrent”  models  such  as  SIS  and  SIRS,  where  multiple  waves  of  infection  can  pass  through  the  popula¬ 
tion.  In  these  models,  the  method  of  Karrer  and  Newman  cannot  be  applied  directly,  since  nodes  can  return 
to  previous  states  (such  as  when  an  infected  individual  becomes  susceptible  again).  We  have  developed  a 
new  class  of  differential  equations  for  these  models.  Our  methods  are  much  faster  than  direct  simulation, 
and  they  are  also  more  efficient  than  the  so-called  “pair  approximation”  currently  used  in  epidemiology, 
especially  for  complex  epidemic  models  where  individuals  have  multiple  states. 

•  M.  Shrestha,  S.  Scarpino,  and  C.  Moore,  “Message-passing  approach  for  models  of  recurrent  epi¬ 
demics  in  networks.”  Physical  Review  E  92,  022821  (2015). 

Preprint:  http://arxiv.org/abs/1505.02192 
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